All outputs will be saved to out/221202_A01366_0326_AHHTTWDMXY/hg38/SHE5052A9_S101/all-well/DGE_filtered/unintegrated/.

Load data and filter genes

Load Parse Biosciences split-pipe pipeline output as a Seurat object with no cut-offs, remove any missing samples, add sample metadata, add summary statistics, and optionally subset samples. Then filter nuclei per gene.

Gene-level filtering is also performed at this point, based on the min_nuclei_per_gene argument. Genes that are present in very few nuclei of the dataset (e.g., <3) are uninformative and unlikely to help in differentiating between groups of nuclei. In general, most genes removed by this filtering step will be those not detected in any nuclei, which will help to trim the size of the object.

The minimum number of nuclei per gene is set to 5.

# seu <- load_parse_to_seurat(
#     parse_dir,
#     experiment,
#     genome,
#     sublibrary,
#     parse_analysis_subdir,
#     min_nFeature_RNA = 0,
#     min_nuclei_per_gene = min_nuclei_per_gene,
#     sample_subset,
#     remove_na_samples = T,
#     do_add_sample_metadata = T,
#     do_add_summary_stats = T,
#     groupings
#   ) 
# # save seu 
# saveRDS(seu, file = here::here(out$base, "seu.rds"))
seu <- readRDS(here::here(out$base, "seu.rds"))
seu <- seu[sample(nrow(seu), nrow(seu)*0.5),sample(ncol(seu), ncol(seu)*0.1)]

Sample-level overview

Nuclei per sample

Plot of summary statistics by sample

Filter nuclei

Nucleus-level filtering is performed, applying cut-offs to nCount_RNA, nFeature_RNA, percent_mito, and doublet values of each nucleus.

Annotate doublets

Putative doublets are identified using the scDblFinder package. This package creates artificial doublets by grouping together random pairs of nuclei within each sample to create pseudo-doublets, and then identifying nuclei that cluster near to these pseudo-doublets during dimensionality reduction. This co-clustering indicates that they carry a doublet signature.

The remove_doublets argument is set to FALSE.

A dummy column will be introduced, treating all nuclei as though they are singlets.

seu$doublet <- 0

Annotate transcript types

Proportions of transcript types of interest (mitochondrial, ribosomal, globin) are annotated. Mitochondrial genes are particularly of interest for filtering, but it is good practice to inspect the others as well.

  • percent_mito - A high percentage of mitochondrial genes indicates death, loss of cytoplasmic RNA, or increased apoptosis.
  • percent_ribo - mRNA that code for ribosomal proteins. They do not point to specific issues, but it can be good to have a look at their relative abundance. They can have biological relevance.
  • percent_globin - Globin genes are very abundant in erythrocytes. Depending on your application, you can expect ‘contamination’ of erythrocytes and select against it.
seu <- annotate_proportions_of_transcript_types(seu)

Define nucleus-level filters

The do_filtering argument is set to TRUE. Filters will be applied.

seu <- get_filters(
  seu,
  do_filtering,
  remove_doublets,
  min_nCount_RNA,
  max_nCount_RNA,
  min_nFeature_RNA,
  max_nFeature_RNA,
  max_percent_mito
)
  • doublet min = 0
  • doublet max = 1
  • nCount_RNA min = 0
  • nCount_RNA max = 15279.9426
  • nFeature_RNA min = 100
  • nFeature_RNA max = 3938.7214
  • percent_mito min = 0
  • percent_mito max = 0.431253127554009

Plot of filters vs feature distribution

Plot of the proportions of filtered nuclei and their fail criteria for each sample

Plot of filters against distribution of nuclei

Apply nucleus-level filters

seu <- subset(seu, cells = unique(dplyr::filter(seu@misc$nucleus_filtering, pass)$nucleus))

# save seu_filtered
saveRDS(seu, here::here(out$base, "seu_filtered.rds"))

100% (168/168) of nuclei retained.

Highest variable genes

Find the most highly variable genes (HVGs) across samples.

seu <- Seurat::FindVariableFeatures(seu)

Plot of HVGs

Heatmap of top 20 HVGs

Integration (optional)

Normalisation and scaling

Seurat::SCTransform() is a function that performs normalisation and scaling of the data.

Variables to regress out of the SCTransform residuals (argument vars_to_regress) are percent_mito and nCount_RNA. These variables are prevented from contributing much to the PCA during dimensionality reduction, and thus confounding the analysis due to irrelevant variation.

seu <- Seurat::SCTransform(seu, vars.to.regress = vars_to_regress)
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 6889 by 168
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 168 cells
## Found 37 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 6889 genes
## Computing corrected count matrix for 6889 genes
## Calculating gene attributes
## Wall clock passed: Time difference of 4.365217 secs
## Determine variable features
## Place corrected count matrix in counts slot
## Regressing out percent_mito, nCount_RNA
## Centering data matrix
## Set default assay to SCT
# save seu_transformed
saveRDS(seu, file = here::here(out$base, "seu_transformed.rds"))

Annotate cell cycle phases

Nuclei are assigned a score representing the likelihood that they are in each phase of the cell cycle. This is based on the expression levels of genes associated with each phase. The highest probability phase of each nucleus is stored in a new “Phase” column of the metadata.

seu <- Seurat::CellCycleScoring(
  seu,
  s.features = Seurat::cc.genes$s.genes,
  g2m.features = Seurat::cc.genes$g2m.genes
)
## Warning: The following features are not present in the object: PCNA, FEN1,
## MCM2, RRM1, GINS2, CDCA7, MLF1IP, RPA2, NASP, GMNN, CCNE2, UBR7, POLD3, RAD51,
## RRM2, CDC45, CDC6, EXO1, TIPIN, DSCC1, CASP8AP2, USP1, CLSPN, POLA1, CHAF1B,
## E2F8, not searching for symbol synonyms
## Warning: The following features are not present in the object: HMGB2, CDK1,
## NUSAP1, UBE2C, BIRC5, TPX2, TOP2A, NDC80, CKS2, CKS1B, MKI67, TMPO, FAM64A,
## SMC4, CCNB2, CKAP2, AURKB, KIF11, ANP32E, GTSE1, KIF20B, CDCA3, HN1, CDC20,
## TTK, CDC25C, NCAPD2, DLGAP5, CDCA2, HMMR, PSRC1, LBR, CKAP5, NEK2, G2E3, CBX5,
## not searching for symbol synonyms

Dimensionality reduction

Dimensionality reduction is performed to isolated the strongest signals and to remove background noise.

Linear dimensionality reduction (PCA)

seu <- Seurat::RunPCA(seu)
## PC_ 1 
## Positive:  FKBP5, RORA, PDK4, ADAMTS9-AS2, PRUNE2, ACSL1, RNLS, CYP3A5, SULT1C2, BCL2 
##     BNIP3L, RNF152, ERRFI1, GPX3, ANK3, MYOCOS, RETREG1, AGBL4, ACSS3, PDE7A 
##     PKHD1, LINC02532, REPS2, LINC00621, SPAG17, SLC9A9, ENSG00000225689, SPSB1, PRKN, FBXO25 
## Negative:  DCBLD2, MACF1, CDK6, PRKCA, CHST11, IGFN1, LAMB1, MAP1B, RUNX2, KTN1 
##     SRGAP1, ZSWIM6, CASK, FRMD5, NRP1, WWC1, RGS20, AGPAT4, TRIO, MIR4435-2HG 
##     ITGA3, DGKH, APP, NEK6, PCED1B, ME1, CARMIL1, ZMIZ1, KIF26B, HSPG2 
## PC_ 2 
## Positive:  ZNF395, SMIM2-AS1, TRABD2B, GPI, PGGHG, CDH4, PITPNC1, WWC1, COL23A1, AGAP1 
##     PRKD1, NEK6, HIF1A-AS3, LAMA5, TTYH3, ITGA3, ANPEP, VAV2, FBXO17, ASTN2 
##     RHPN2, EML6, MAST4, ANGPTL4, COL27A1, EVA1A, NRP1, KCNK3, MAML2, RFX2 
## Negative:  FKBP5, CHST11, IRS2, ABTB2, PPM1L, PCED1B, ACSL1, SQSTM1, EFNA5, ESYT2 
##     CYP3A5, ADAMTS9-AS2, DOCK4, FBXO25, PDK4, SORT1, ZSWIM6, TLR2, GFPT1, UTRN 
##     IGFN1, GBE1, CACNA2D1, PDE8A, NAP1L1, PHIP, AGPAT4, ENSG00000291147, DCBLD2, CARD11 
## PC_ 3 
## Positive:  TCF4, LDB2, SHANK3, MGAT4A, EBF1, PECAM1, BTNL9, IQSEC1, PTPRB, PBX1 
##     GNA14, PRKCH, ENSG00000284686, NOTCH4, MEF2C, ADGRL4, ADGRF5, PLPP1, ARHGEF3, ENSG00000289873 
##     RAPGEF5, FMNL3, SPRY1, RASGRF2, ENSG00000286508, SH3BP5, ENSG00000248636, KALRN, FLT4, HECW2 
## Negative:  EFNA5, SQSTM1, TRIO, WWC1, DCBLD2, IGFN1, PRKCA, NHS, AKAP12, HAVCR1 
##     RUNX2, GALNT14, ABCC4, PLEKHA5, ITGA3, SRGAP1, TUSC3, LINC01322, NAP1L1, FRMD5 
##     KIF21A, UGCG, NDRG1, CARMIL1, CD70, ME1, ABCC2, PTPRJ, CDR2, TPM1 
## PC_ 4 
## Positive:  MAT2A, ANXA2, SNHG1, ADM, TGM2, ENO1, CDCA8, RUNX1, YWHAQ, HCFC1 
##     CENPE, SPSB1, ABTB2, KIF14, DTL, SNAPC4, CRY1, HSP90AA1, RFWD3, KMT2D 
##     LMNB1, KLF10, PPP2CA, GAK, CCT2, EEF2, ATP13A3, TCOF1, ARRDC3, NAMPT 
## Negative:  LINC00472, PLEKHA5, TMEM178B, ENSG00000237813, DERA, SSH2, EPB41L4A, FLG-AS1, MBD5, FRMD5 
##     RGS20, MYLK, HMGA1P4, VWA8, PDGFC, FGD6, LINC00278, TIMD4, EXT1, CRPPA 
##     MAPK10, ENSG00000241962, ARHGAP29, LINC02646, RPS6KA2, LINC00342, SLC4A4, DGKH, LAMA3, XKR6 
## PC_ 5 
## Positive:  TGM2, GBE1, PPARD, PHLDB2, FMNL2, HRH1, ITGA3, EFNA5, ANGPTL4, CA12 
##     ATP13A3, PXYLP1, NDRG1, PTPRE, SPSB1, SDK1, RASA3, ANO6, ABTB2, PTPN14 
##     PLEKHA1, GLIS3, SGMS2, LINC01320, ITPR3, MAML2, NCK2, LIF, NAV2, SEMA6A 
## Negative:  ACSM2B, CDHR5, ATP1B1, ENPP7P8, OSBPL11, ENSG00000206129, WDR72, MMADHC-DT, EBNA1BP2, BBOX1 
##     CEP120, GRB2, PNPLA7, MYO7B, DPYS, FBXL3, PLEKHF2, GPX3, PRPF40A, POLR2A 
##     SLC12A6, GATM, CLIC4, ACACB, MGAT1, SLC13A1, IRF1, APOL6, LINC02532, TMEM63A

Plot of PC1 vs PC2, coloured by sample

Plot of cells and genes sorted by their PC scores for PC1 and PC2

## Warning: Requested number is larger than the number of available items (168).
## Setting to 168.

## Warning: Requested number is larger than the number of available items (168).
## Setting to 168.

Plot of top genes associated with PC1 and PC2

Dimensionality of the data

The dimensionality of the dataset to use for downstream analysis must be decided. This is the number of principal components believed to capture the majority of true biological signal in the dataset. Using too many PCs usually does not affect the result too much, while including too few PCs can be detrimental, as meaningful biological variation may be lost. Therefore, it is better to be overgenerous when defining the dimensionality of the data than to underestimate it.

The dimensionality can be decided by consulting the elbow plot (see below) and manually picking a value (set by argument n_dims) or it can be calculated using the intrinsicDimensionality package.

intrinsicDimensionality

No value was provided for n_dims, so dimensionality will be calculated using the intrinsicDimensions package.

The dimensionality of the dataset can also be estimated using the intrinsicDimension package. This bypasses the manual approach of reading the elbow plot, but can sometimes be a bit too conservative. Therefore, if n_dims is not defined by the user, the dimensionality is set as double the intrinsicDimension estimate.

n_dims_iD <- intrinsicDimension::maxLikGlobalDimEst(
    seu@reductions$pca@cell.embeddings,
    k = 10
  )$dim.est %>% round(0)
generous_n_dims <- round(2 * n_dims_iD, 0)
final_n_dims <- generous_n_dims

Elbow plot

Dimensionality set to 24.

Non-linear dimensionality reduction (UMAP, tSNE)

Plot of UMAP and tSNE, coloured by sample

Plot of all reductions, coloured by groupings

A plot of all available groupings of the data, projected onto the three dimensionality reductions (PCA, UMAP, and tSNE).

Clustering

From Clarke et al., 2021

Annotating cell states and gradients When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states. The definitions of cell type and state are not yet standardized, but a stable cell type may be expected to have homogeneous gene expression across a cluster and be compact in a 2D projection plot, whereas cell gradients appear as a spread-out string of cells and cell states (e.g., cell cycle state). Expression gradients indicate continuous differences that are present in the cell population, which could represent states like the cell cycle, immune activation63, spatial patterning64 or transient developmental stages. Care must be taken to distinguish biologically meaningful cell states and experimental batch effects, which can manifest in a similar way.

We perform unsupervised transcriptome similarity-based clustering to identify groups of nuclei with relatively homogeneous transcript profiles (united by a shared celltype / state). Clustering is performed at different resolutions to explore how groupings change across the parameter space. The aim is to find a resolution at which the clusters appear to best reflect true biological subpopulations of the dataset.

This process is based on the assumption that true discrete clusters of nuclei do exist within the dataset. The preceding nucleus filtering, feature selection, and dimensionality reduction steps were taken to distill only the highest quality features that reflect the underlying structure of the population and to remove noise that distracts from this structure.

As with the final dimensionality parameter, there is no deterministic way to set the final clustering resolution and no definitive best value. Instead, one must qualitatively assess clustering outputs for the given dataset and pick a ‘good’ resolution. Typically, a ‘good’ final resolution is taken as one that groups nuclei into celltypes. However, looking at multiple clustering resolutions of the same dataset can provide multiple true and valuable conclusions about the dataset at multiple levels of functional subspecialisation of cells.

Setting a low clustering resolution will produce fewer clusters, reflect the broadest subdivisions between groups of nuclei (e.g. different tissues / differentiation lineages), and is more robust to noise, but can miss important local structure within the dataset.

Higher clustering resolutions will result in more, increasingly granular clusters and can reveal celltype-level groups. In can also reveal new / rare celltypes that are hidden at lower resolutions. Setting the clustering resolution too high, however, can produce meaningless groups and lead to overfitting.

Find neighbours

The first step is to compute the k.param nearest neighbours for a given dataset. This step constructs a shared nearest neighbours (SNN) graph from euclidean distances between nuclei in PCA space, and then computes edge weights between each pair of nuclei based on their Jaccard similarity (the shared overlap of their local neighborhoods).

seu <- Seurat::FindNeighbors(seu, dims = 1:final_n_dims)
## Computing nearest neighbor graph
## Computing SNN

Find clusters at different clustering resolutions

Then, groups of nuclei are iteratively grouped together. Clusters are identified by a SNN modularity optimisation-based clustering algorithm. This procedure is repeated for all clustering resolutions to be tested (set by the clustering_resolutions argument).

The clustering resolutions to be tested are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8.

seu <- Seurat::FindClusters(seu, resolution = clustering_resolutions, verbose = F)

Evaluate all clustering resolution

Plot of all reductions, coloured by clusters

A final clustering resolution can be decided by looking for sensible groupings of nuclei projected onto the dimensionality reduction plots.

Plot of clustering tree

Alternatively, the clustree package provides a more intuitive way to visualise clustering behaviour at different resolutions and to devide on a final clustering resolution. It constructs clustering trees that show how samples move and split up as the number of clusters increases. A ‘solid’ clustering resolution is taken as one in a part of the tree where clusters remain relatively stable and constant for a few resolution. This indicates that the groups are robust and strongly supported by the data, rather than volatile and noisy.

Choose a final clustering resolution

The final clustering resolution has been set to 0.3.

seu$cluster <- seu@meta.data[, paste0(snn_res_prefixes, final_clustering_resolution)]

# save seu_clustered
saveRDS(seu, here::here(out$base, "seu_clustered.rds"))

Plots of clusters per sample and samples per cluster

Celltype annotation

The conventional approach to celltype annotation for sc/snRNAseq datasets is to inspect known marker genes from each of the celltypes that are expected and likely to be found at the sample site. Known relationships between marker genes and celltypes of interest can be obtained from databases or manually curated from literature. Then, the expression profiles of all markers from all celltypes across all clusters is programmatically or visually inspected and clusters are then labelled according to their marker profile.

Automated annotation using database markers

The package SingleR returns “the best annotation for each cell in a test dataset, given a labelled reference dataset in the same feature space”. Using the human primary cell atlas from celldex, we can find the nearest reference celltype to each nucleus in the dataset. This approach is good because it is unbiased and does not require manual assignment, but it is unclear how well snRNAseq output will map onto a scRNAseq atlas. Additionally, the reference dataset from celldex is more rigid, cannot be customised to the use case, and does not contain transformed celltypes.

human_primary_ref <- celldex::HumanPrimaryCellAtlasData()
## snapshotDate(): 2022-10-31
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
seu_singler <- SingleR::SingleR(
  test = Seurat::GetAssayData(seu, slot = "data"),
  ref = human_primary_ref,
  labels = human_primary_ref$label.main
)

Heatmap of celltype annotation scores from SingleR

Delta distribution of celltype annotations from SingleR

## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0

Some annotations will contain very few nuclei. These are usually not of interest and clutter up plotting, so we remove annotations assigned to < 10 nuclei.

seu$singler_annot <- seu_singler %>%
      dplyr::as_tibble() %>%
      dplyr::group_by(labels) %>%
      dplyr::transmute(singler_annot = replace(labels, dplyr::n() < 10, "none")) %>%
      dplyr::pull(singler_annot)

Plot of UMAP clusters, coloured by SingleR annotation and labelled by majority SingleR annotation per cluster

Plot of SingleR annotation composition of samples

Alluvial plot of assignment paths of nuclei from samples to clusters to SingleR annotations

Manual annotation using curated literature markers

Literature markers for celltypes of interest in ccRCC tumours